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ABSTRACT 
Motivation: 

Annotations are a key feature of many biological databases, used 
to convey our knowledge of a sequence to the reader. Ideally, 
annotations are curated manually, however manual curation is costly, 
time consuming and requires expert knowledge and training. Given 
these issues and the exponential increase of data, many databases 
implement automated annotation pipelines in an attempt to avoid un- 
annotated entries. Both manual and automated annotations vary in 
quality between databases and annotators, making assessment of 
annotation reliability problematic for users. The community lacks a 
generic measure for determining annotation quality and correctness, 
which we look at addressing within this paper. Specifically we 
investigate word reuse within bulk textual annotations and relate this 
to Zipf's Principle of Least Effort. We use UniProtKB as a case study 
to demonstrate this approach since it allows us to compare annotation 
change, both over time and between automated and manually-curated 
annotations. 
Results: 

By applying power-law distributions to word reuse in annotation, we 
show clear trends in UniProtKB over time, which are consistent with 
existing studies of quality on free text English. Further, we show a clear 
distinction between manual and automated analysis and investigate 
cohorts of protein records as they mature. These results suggest 
that this approach holds distinct promise as a mechanism for judging 
annotation quality. 
Availability: 

Source code and supplementary data are available at the authors 
website: http://homepages.cs.ncl.ac.Uk/m.j.bell1/annotation/ 
Contact: phillip.lord@newcastle.ac.uk 



1 INTRODUCTION 

A key descriptive feature of biological data is its annotation: a textual 
representation of the biology associated with the data. Biologists use 
these annotations to understand and contextualise data in biological 
sequence databases. Annotations play an essential role in describing 
and developing the users' knowledge of a given sequence and can 
form the foundation for further research ( Jones et al. 2007 ). Some 
annotation is structured, for example, using an ontology ([Stevens] 
| and Lor d 2009 1 or keyword list. However, free text annotation often 
contains the richest biological knowledge (Camon et al. . 2005 ), but 
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while free text is appropriate for human comprehension it is difficult 
to interpret computationally. 

The current 'gold standard' for annotation is a set of reviewed 
and manually-curated entries ( |Curwen et al.\ |2004). However, 
manually-curated annotation is labour-intensive, time consuming 
and costly. To cope with the amount of data, which is 
typically increasing exponentially, many resources and projects 
generate annotations computationally (Boeckman n, B. et fl/.||2003] >. 
Automated annotations are more prone to errors than their manual 
counterparts ( |Gilks et al. | [2002 1, with several studies suggesting 
high levels of misannotation in automated annotation ( [Schnoes] 
\eTal\ |20Q9l |Andorf et al\ f2WT\ | Jones et al\ |2007) . It can be 
hard, even impossible, to determine the source from which an error 
has propagated piuza et ol]|2008| l causing significant problems for 
biologists. Annotation quality is not consistent across all databases 
and annotators (Dolan et al. , 2005), whether curated manually or 
automatically. It can, therefore, be difficult to determine the level 
of quality, maturity or correctness of a given textual annotation. 
However users often incorrectly assume that annotations are of 
consistent quality and correctness l |Ussery a nd Hallin 2004[ >. 

Currently there are few standard metrics for assessing annotation 
quality. Annotations are frequently assigned a score, using a variety 
of methods. These approaches include assigning confidence scores to 
annotations based on their stability {Gross et' al. 2009 1 or combining 
the breadth (coverage of gene product) and the depth (level of 
detail) for the terms in the Gene Ontology (GO) piuza et aT[|2008[ >. 
However, while deeper nodes within an ontology are generally more 
specialised, these measures are problematic; first GO has three root 
domains and second an ontology, such as GO, is a graph not a tree, 
therefore depth is not necessarily meaningful. Other methods (Rogers 
|and Ben-Hur[|2009||Buza et a/.||2008[|Pal and Eisenberg||2005) use 
evidence codes as a basis for an annotations reliability, although 
ironically, the Gene Ontology annotation manual explicitly states 
that evidence codes should NOT be used in this way {GO Consortium| 
|2011) , describing rather the type of evidence not its strength. 

All of these approaches rely upon additional information to 
determine annotation quality. Resources, such as sequence databases, 
vary in their structures and in the additional information stored. For 
example, not all resources use evidence codes and these codes are not 
comparable between resources ( |Lord et al. , 2003 1; likewise, it is not 
generally possible to use methods based on an ontological hierarchy 
for non-ontological resources. 

Most resources carry some annotations which are unstructured, 
free text. Therefore, a quality metric that can be derived purely 
from textual annotation would potentially allow any resource to be 
analysed and scored. There are various measures for analysing the 
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quality of text, such as the Flesch-Kincaid Readability Test ( |Flesch| 
[T948t > and SMOG Grading l |Laughlin| [T969] >. These metrics are 
generally based around readability, or reading-age; that is the literary 
quality of the text, rather than correctness and quality of the subject 
matter. 

In this paper, we report on a bulk analysis of textual annotation 
in UniProtKB, attempting to understand whether we can exploit our 
knowledge of changes in the annotation over time as a mechanism 
for developing a quality measure of biological correctness. We 
investigate word occurrences, and their changes over time, as 
reflected in their distribution; we show that using these relationships 
we are able to detect large-scale changes in the annotation process; 
and we demonstrate that the parameters of these relationships also 
change. Specifically, we fit a power-law distribution to the extracted 
word occurrences and extract a value, called a. We relate this a value 
to Zipf 's principle of least effort, which states it is human nature to 
take the path of least effort to achieve a goal. Broadly, higher values 
of a indicate a resource which is easier for the reader, while lower 
values are easier for the annotator. 

While manually curated annotation is generally accepted as the 
most accurate l [Curwen et a/.||2004) , a significant problem is the lack 
of more explicit gold standard data sets (James et al. 2 011||Roberts| 
|2011[ >. This makes defining a quality measure somewhat troublesome. 
Our investigation into whether changes in word distribution are 
representative of quality and maturity in the annotation show that 
these forms of measures can detect large-scale features of annotation, 
and that clear trends appear as UniProtKB matures and grows over 
time. These trends are often reflective of our a priori judgements 
of quality within UniProtKB, for example, a distinction between 
manual and automated annotation. In the absence of a gold standard, 
we believe that this represents reasonable evidence that this form 
of analysis may be used as the basis for a quality metric for textual 
annotation. 

2 METHODS 

2.1 Data extraction from UniProtKB 

The UniProt KnowledgeBase (UniProtKB) jUniProt Consortium| |20T0) 
consists of two sections: UniProtKB/Swiss-Prot, which is reviewed and 
manually annotated, and UniProtKB/TrEMBL which is unreviewed and 
automatically annotated. The first version of Swiss-Prot was released in 1986, 
with TrEMBL appearing in 1996. Releases of TrEMBL were initially more 
frequent than Swiss-Prot, meaning the databases were released independently. 
In Table[TJ we map between each Swiss-Prot release and the nearest version of 
TrEMBL. This allows us to compare the quality of manually and automatically 
curated annotation at similar points in time. 

Following the formation of the UniProt Consortium in 2002, the releases 
of the two databases were synchronised (from 2004). The correct names are 
now technically UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. We will 
use the following naming approach for clarity: 

• UniProt - Refers to the UniProt Consortium. 

• Swiss-Prot - Refers to Swiss-Prot entries prior to the formation of the 
UniProt Consortium. 

• TrEMBL - Refers to TrEMBL entries prior to the formation of the 
UniProt Consortium. 

• UniProtKB - Refers to the combination of both Swiss-Prot and 
TrEMBL datasets. 



Table 1. Mapping between TrEMBL and Swiss-Prot release dates. For 

each version of Swiss-Prot, we have associated the nearest version of TrEMBL 
based on release date. 



Date 


Swiss-Prot version 


Date 


TrEMBL version 


Oct-96 


34 


Nov-96 


1 


Nov-97 


35 


Jan-98 


5 


Jul-98 


36 


Aug-98 


7 


Dec-98 


37 


Jan-99 


9 


Jul-99 


38 


Aug-99 


11 


May-00 


39 


May-00 


13 


Oct-01 


40 


Oct-01 


18 


Feb-03 


41 


Mar-03 


23 


Oct-03 


42 


Oct-03 


25 


Mar-04 


43 


Mar-04 


26 



Where necessary we will explicitly write UniProtKB/Swiss-Prot or 
UniProtKB/TrEMBL. This naming scheme allows us to refer to post- 
UniProtKB versions of UniProtKB/Swiss-Prot and UniProtKB/TrEMBL 
with the same number, starting from version two of UniProtKB^ We can 
investigate annotation change over time, as complete datasets for historical 
versions of UniProtKB and Swiss-Prot are made available by UniProt on 
their FTP server, with the exception of Swiss-Prot versions 1-8 and 10 which 
were never archived. Pre-UniProtKB/TrEMBL releases were kindly made 
available to us by UniProt. 

Our extraction approach, also summarised in Figure[TJ involves four key 
steps: 

1 . The UniProt FTP serve^Jprovides complete datasets for past versions 
of Swiss-Prot and UniProtKB in flat file format. 

2. UniProtKB flat files adhere to a strict structure, as detailed in the 
UniProtKB user manual (UniProt Consortium 201 1 1. A Java framework 
was created that allowed UniProtKB comment lines to be correctly 
extracted. 

3. Over time annotations in UniProtKB have become more structured with 
the addition of topic headings (eg"subcellular location" and "function"). 
These headings are ignored by our analysis. We also remove punctuation, 
the 'CC identifier, brackets and whitespace. 

4. The final step in this process is to output a list of all words and their 
frequency for all annotations in a given database version. 

In order to ensure accurate data extraction, we checked that the number 
of entries parsed (Figure ^ step 2) matched the number expected from 
the release notes. Additionally the list of headings (comment blocks and 
properties) removed (Figure^ step 3) were noted, along with their frequency, 
to ensure only headings described in the UniProtKB manual were removed. 
Finally, a random selection of records were manually checked against parsed 
outputs (Figure[Tj all steps). 

2.2 Model fitting 

When developing a framework to model the occurrence of words, a variety 
of competing models were considered. These ranged from relatively simple 
distributions, such as the exponential and log-normal, to more complex 
mixture models. However, the power-law distribution achieved a good balance 
between model parsimony and fit. 

In this paper, we only deal with the discrete power-law distribution (see 
jClauset et al. 2009 1 for a discussion on power-laws). The discrete power-law 



1 Version two was the first major release containing Swiss-Prot version 44 
a nd TrEMBL version 27. 

2 ftp.uniprot.org/pub/databases/uniprot/ 



2 



An approach to describe and analyse bulk annotation quality 



(1) 





(2)|> 










I 





CC -!- DEVELOPMENTAL STAGE: Expressed in the 
developing eye and brain. 
CC-!- PTM: llbiquitinated by TRIM11 , leading to 
ubiquitination and 

CC proteasomal degradation (By similarity). 
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expressed in the developing eye and brain 
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proteasomal degradation by similarity 
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Fig. 1. Outline view of the data extraction process. (1) Initially we 
download a complete dataset for a given database version in flat file format. (2) 
We then extract the comment lines (lines beginning with 'CC, the comment 
indicator). (3) We remove comment blocks and properties (as defined in the 
UniProtKB manual I UniProt Consortium. 201 1 1), punctuation, 'CC, brackets 
and make words lower case, so as to treat them as case insensitive. (4) Finally, 
we count the individual words and update the occurrence of each word total 
count. 



distribution has probability mass function 

x~ a 

where 

oo 

C(a,x min ) = y^(n + x min )~° 

is the generalised or Hurwitz zeta function. 

To fit the power-law distribution, we followed the Bayesian paradigm. 
We assumed a proper uniform (7(1,5) prior for a. Since the posterior 
distribution for the parameters is analytically intractable, we integrate out the 
uncertainty using a MCMC algorithm. The parameter space was explored 
using a Gaussian random walk. The Markov chain reached equilibrium very 
quickly and only a small amount of thinning was necessary. 

We modelled multiple datasets simultaneously using a "fixed-effects" 
approach. Let i denote the dataset of interest, then we aim to infer the 
parameter 

cti = a + in 

where a is the coefficient for a baseline dataset and fii is difference from this 
baseline. For example in Figure|3] the baseline dataset is UniProtKB/Swiss- 
Prot version 16 and fii represents the change in the a coefficient from 
UniProtKB/Swiss-Prot version 16. 

Throughout this paper, we set ic m i n = 50, which we determined using the 
BIC criteria when fitting all the datasets in Figure]?] However, the conclusions 
are not sensitive to changes of x mln . For smaller values of x m i n , the credible 
region is reduced since there is more data, conversely increasing x m ; n to 
around 200 increases the credible regions slightly. 

The fitting of a power-law corresponds to the exponent of the regression line 
represented by a. Given that the graph and a value is based on the underlying 
text, it is plausible that the a value could provide a measurement to assess the 
underlying textual quality. Indeed, it has been previously suggested jFerrer| 
|2005| that the a value is related to Zipf 's principle of least effort l Zipfj |1949| ;i | 
This principle states that it is human nature to take the path of least effort 

3 Additionally, Zipf has shown that a words occurrence is inversely 
proportional to its rank (Zipf 's Law). Zipf 's law, Pareto's law and power-law 
distributions are all types of power-law that are, essentially, different ways of 
looking at the same thing I Adamic and Hubermanj |2002| . 



when achieving a goal. For example, an annotator can create an annotation 
with generic terms (least effort for the annotator, more work for the reader) or 
with precise and specialist terms (least effort for the reader, more work for the 
annotator). Table|2]shows a values that have been extracted from a variety 
of texts, that give confidence to this claim. We can use this information as 
a basis for quality; texts which require minimal effort for the reader, due to 
expertly curated annotation, are deemed to be of high quality. 

3 RESULTS 

3.1 Does annotation in UniProtKB obey a power-law 
distribution? 

Power-laws have been shown to exist in numerous man-made and 
natural phenomena ( Clauset et al. 2009). The link between Zipf's 
principle of least effort and a was originally based on natural 
language. If a power-law distribution is a measure of quality, 
we would expect that a power-law distribution is more likely to 
occur in human-curated annotation rather than annotations produced 
automatically. Therefore, for our initial analysis, we selected Swiss- 
Prot as a gold standard resource. Results are shown in Figure[2ja), for 
two versions of Swiss-Prot. We can see that annotation does broadly 
obey a power-law, although with a distinct "kink" in Swiss-Prot 
version 37, between x — 10 4 and x = 10 5 . 

Inspection of the words in this region showed that this structure is 
artifactual, resulting not from annotation per se but from copyright 
and license information, which are included in the CC lines, along 
with biological information. These copyright statements were first 
introduced into Swiss-Prot at version 37, with wording changes at 
UniProtKB versions 4 and 7. From this analysis, we show that we 
can detect the introduction of a large amount of material with no 
biological significance into the annotation. This demonstrates that 
the power-law can be used as a partial measure of quality, albeit for 
detecting artifacts. 

For subsequent analysis we removed the copyright statements. 
Updated graphs, with copyright statements removed, are also shown 
in Figure|2ja). Inspection of these graphs show that the slope for the 
head and tail increase at different rates with Swiss-Prot versions. This 
is a result of a marked two-slope behaviour which is commonly seen 
for mature resources, such as large complex natural languages ( |Ha| 
\et a/.| |2006; Cancho and Sole 2001 ). These graphs follow a power- 
law distribution reasonably well. However Figure|2jb) shows some 
versions of TrEMBL where a power-law does not fit that well. As 
discussed in Section |272| various competing models were considered, 
with a power-law being chosen partly due to the simplicity of its 
output, a. It is clear that this change over time requires further 
analysis. 

3.2 How do the distributions change over time? 

Although it is useful for demonstrating the two-slope behaviour, 
this view makes it difficult to see change over time. Given that the 
main analytical value comes from the extracted a values, subsequent 
graphs show just these values for different database versions. This 
approach allows us to investigate the change over time by looking 
at all historical data simultaneously. The resulting graphs from this 
analysis is shown in the top half of Figure|3] 

As Figure[3]shows, the annotation in Swiss-Prot is changing in its 
nature over time, a decreases over time for Swiss-Prot, suggesting 
that Swiss-Prot is becoming optimized toward least effort for the 
annotator, rather than the reader. This fits with previous research from 
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Table 2. Relationship between ct value and Zipf's principle of least effort. For a values less than 1.6 or greater than 2.4, we have no corresponding effort 
level as the text is treated as incomprehensible. 



a value 



Examples in literature 



Least effort for 



a < 1.6 



1.6 < a < 



a = 2 

2 < a < 2.4 
a > 2.4 



Advanced schizophrenia |Zipf[|l949{|Piotrowska and Piotrowska 2004 1, young children i Piotrowska 



and Piotrowska 2004 ; Brillouin 

I M M 



p004] 



2 Military combat texts (Piotrowska and Piotrowska 2004}, Wi kipedia i Serrano et al. 2009 1, Web Annotator 



pages listed on the open directory project I Serrano et al. 2009} 



Single author texts (Balasubrahmanyan and Naranan 1996) 
Multi author texts ( |Ferrericancho| [2005j> 
Fragmented discourse schizophrenia JPiotrowska and Piotrowska) 12004} 



Equal effort levels 
Audience 



|Baumgartner et q/T| ( |2007| > suggesting that the enormous increase in 
proteins requiring annotation is outstripping the provision of this 
annotation. This issue has been acknowledged by UniProt, with their 
introduction of automated annotation; therefore, we next investigate 
this form of annotation. 

3.3 How does manual annotation compare to 
automated annotation? 

Within UniProtKB proteins are initially annotated automatically and 
placed into TrEMBL. Eventually they are manually annotated and 
placed into Swiss-Prot. Therefore, TrEMBL and Swiss-Prot are ideal 
resources to compare equivalent human and automated annotations. 
Here, we compare these two resources, investigating their behaviour 
over time, at equivalent points in time. As previously described, 
prior to UniProtKB version two, TrEMBL and Swiss-Prot releases 
were not synchronized, so we use the version of TrEMBL released 
closest in time to each version of Swiss-Prot, as show in Table ^ 
An evenly spaced subset of these analyses are shown in Figure[2jb), 
with Figure|3]showing the a values for all versions of TrEMBL and 
Swiss-Prot. 

In Figure |2jb), TrEMBL and Swiss-Prot appear to diverge over 
time with Swiss-Prot demonstrating the behaviour typical of a more 
mature resource. TrEMBL shows less maturity, with many words 
occurring with a high frequency. In short, Swiss-Prot appears to 
show a richer use of vocabulary. We cannot, however, rule out 
the possibility that this difference occurs as they are annotating 
different proteins. Unfortunately, it is not possible to check a proteins 
annotation in both Swiss-Prot and TrEMBL at the same point in 
time; once a record is migrated to Swiss-Prot, it is removed from 
subsequent versions of TrEMBL. This is necessary as Swiss-Prot is 
used as a basis for annotation in TrEMBL, so proteins not removed 
from TrEMBL would have their automated annotation based on their 
manual annotation in Swiss-Prot. However, the rapid increase in size 
of both resources, argues against this explanation. 

While Swiss-Prot shows a relatively regular progression, TrEMBL 
does not. There are two significant disjuncts in the relationship where 
large jumps occur between releases, as highlighted in Figure[3] We 
also note a significant rise of total words between these versions, 
compared to those for nearby releases (data not shown). 

We have identified two plausible explanations for these disjuncts, 
based on historical events. Firstly in 1998 (highlighted by disjunct a) 
a number of new procedures appear to have been introduced (Bairoch 
|and Ap weiler 1998 1. These approaches include making use of the 
ENZYME database, specialised genomic databases and scanning 
for PROSITE patterns compatible with an entries taxonomic range. 



PROSITE patterns are used to enhance the content of the comment 
lines by adding information such as protein function and subcellular 
location. Interestingly, prior to this disjunct, the first four versions of 
TrEMBL have an a value higher than their Swiss-Prot counterpart. 

Secondly in 2000, the introduction and development of annotation 
rules was planned in TrEMBL which could explain the second jump 
(highlighted by disjunct b) (Bairoch and Apweiler 2000). Both of 
these disjuncts would be expected to produce an increase in the total 
amount of annotation, as well as introducing new words and phrases 
which would affect the measures described here. Given the lack of 
detailed statistics and the age of the database at this time, UniProt 
could not confirm these explanations. They did acknowledge that 
extensive work in 2001 and early 2002 was carried out to improve the 
data, although they believe the scanning of PROSITE was in effect 
from TrEMBL version 1 . 

The increase of total words noted earlier correlates with the 
increase of entries into UniProtKB; the rate of data being added 
is exponential. Given this increase, we find ourselves analysing 
entries and annotations of mixed age. Here we have seen the apparent 
decreasing of quality for complete datasets, of both Swiss-Prot and 
TrEMBL, over time. Following on from this, we wish to explore the 
quality of annotations within a set of mature entries. 



3.4 Analysing maturity of entries over time and the 
impact of new annotations 

Our prior analysis has investigated annotation quality in bulk, without 
analysing how individual records are maturing. If we consider 
"maturity" as a simple function of age, then we would expect, given 
the rapid increase in the size of Swiss-Prot, while new records appear, 
the older entries should mature. Figure|5ja) illustrates the exponential 
rate at which Swiss-Prot and TrEMBL are growing, showing the 
number of entries in each database version. 

Each entry contains a date stamp, indicating when it was first 
introduced into the database. We use this information to show that the 
average creation date of a record has increased only slowly over the 
life span of Swiss-Prot as a whole - illustrated in Figure|5jb). Swiss- 
Prot is currently around 20 years old, yet the average record age is 
around 5 years old and decreasing. This is illustrated in Figure|5|c), 
where we show the difference between the average creation date and 
release date. As an example, Swiss-Prot version 9 was released in 
November 1988 and the average entry release date is July 1987, so 
the figure reflects this difference of 1 year and 4 months. Given this, 
we wished to abstract from increasing size of Swiss-Prot and ask 
whether individual records appear to be maturing. 
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Fig. 2. Cumulative distributions of words for various Swiss-Prot and TrEMBL versions, shown with logarithmic scales. The size (number of words) is shown 
along the X axis while the probability is shown on the Y axis. A point on the graph represents the probability that a word will occur x or more times. For 
example, the upper left most point represents the probability of 1 (i.e. 10°) that a given word will occur once (i.e. 10°) or more times. A word must occur at 
least once to be included. Words occurring very frequently are presented in the bottom right of the graph, (a) Shows the resulting graphs for Swiss-Prot version 9 
(November 1988) and Swiss-Prot version 37 (December 1998), with and without copyright. The distinct structure visible between x = 10 4 and x = 10 5 in 
Swiss-Prot version 37 (bottom left panel) is caused by the copyright statement declaration. Swiss-Prot version 9 operates as a control to show that the attempted 
removal of copyright has no effect where no copyright information is present, (b) Shows the data with fitted power-law distributions for an even subset of 
historical versions of Swiss-Prot and the co-ordinate release of TrEMBL. 
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Fig. 3. a values over time, for each version of Swiss-Prot and TrEMBL. 
The graph shows the difference in a value (with 95% credible region) from 
UniProtKB/Swiss-Prot version 16, for which the a value was 1.62. So, for 
example, Swiss-Prot version 9 has a difference of, approximately, 0.45. 
Therefore the resulting a for Swiss-Prot version 9 is around 2.07. 



This analysis is not straight-forward; we need, essentially, a set of 
records which relate to a defined set of proteins. To achieve this we 
extracted the annotations from all of the entries that were common 
in both Swiss-Prot version 9 and UniProtKB/Swiss-Prot version 15 



(the first and last version available to us). The resulting a values are 
shown in Figure[4jb), with the addition of the a value for those entries 
in UniProtKB version 15 but not Swiss-Prot version 9. These results 
show that the a value for the mature set of entries has decreased over 
time, correlating with the Swiss-Prot database as a whole. 

Given that the a value for mature entries has decreased over time, 
it is of interest to investigate the a values of entries that are new to 
each version of Swiss-Prot. For this, we extracted annotations from 
entries that appeared for the first time in a given database version. 
Results of this analysis are shown in Figure [4j a). It again would 
appear that the a value is decreasing over time, similar to that of 
other Swiss-Prot graphs. 



4 DISCUSSION 

The biological community lacks a generic quality metric that allows 
biological annotation to be quantitatively assessed and compared. 
In this paper we applied power-law distributions to the UniProtKB 
database and linked the extracted a values to Zipf 's principle of 
least effort in an attempt to derive such a generic quality metric. The 
results within this paper give confidence to our initial hypothesis that 
this approach holds promise as a quality metric for textual annotation. 

Initially, our analysis focused on the manually-curated Swiss- 
Prot. As shown in Figure [3] early versions of Swiss-Prot give a 
values that suggest annotations were of high quality; that is, they 
were of least effort for the reader. However, over time we see a 
steady reduction in the a value, which does suggest that the average 
annotation is now harder for readers to interpret, requiring more 
expertise to consume the data than was previously required. This 
result is perhaps best explained by the exponential increase of data 
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Fig. 5. Swiss-Prot (red circles) and TrEMBL (blue triangles), (a) Growth (number of entries) in Swiss-Prot and TrEMBL over time, (b) Average creation date 
over time for Swiss-Prot and TrEMBL. (c) Difference between release date and average creation date (i.e. age) over time. 
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Fig. 4. (a) Analysis of those entries that are new to a particular version of 
Swiss-Prot. (b) a value (with 95% credible region) for all entries in Swiss- 
Prot version 9 that are in UniProtKB version 15, all entries in UniProtKB 
version 15 that are in Swiss-Prot version 9, and all those in UniProtKB version 
15, but not in Swiss-Prot version 9. 



that is added to Swiss-Prot. Manual annotations are regarded as 
the highest quality annotation available and it is inevitable that the 



acknowledged pressure on manual annotation is going to increase. 
This conclusion appears to fit with previous research (Bau mgartner| 
\et al. | [2007 1 that shows manual curation techniques cannot keep up 
with the increasing rate of data. 

Many of the patterns exhibited by Swiss-Prot are also shown in our 
analysis of TrEMBL. From Figure[3] we conclude that annotation in 
Swiss-Prot and TrEMBL show similar characteristics in that, for both 
cases, annotation appears to be increasingly optimized to minimize 
efforts for the annotator rather than the reader; unsurprisingly, this 
appears to be more pronounced for TrEMBL than for Swiss-Prot. 
We are currently unclear whether this form of direct numerical 
comparison over the two different resources is highly meaningful, 
although this distinction between the two resources appears to be 
more pronounced over time rather than less. Therefore, these results 
are consistent with the conclusion that manual annotation behaves 
as a significantly more mature language than automated annotation. 
This fits with our a priori assumptions which again is suggestive that 
this form of analysis is operating as a measure of quality. 

In addition to analysing whole UniProtKB datasets, we also 
investigated how sets of entries mature over time (Figure HJb)) 
and the quality of annotations within new entries (Figure Bla)). 
Within the mature entries we interestingly see a decrease in quality 
over time, rather than increasing or maintaining a similar quality 
level. However this decrease is much slower than the Swiss-Prot 
database as a whole over time, and is still of much higher quality 
than the remainder of entries in UniProtKB version 15. For the 
new annotations we also see a general decrease in quality over 
time. It is plausible that these results stem from the manner of 
management and curation of annotations; the UniProt annotation 
protocol consists of six key steps (Mag rane and Uniprot Consortium! 
|201 1) , one of which is identifying similar entries (from the same 
gene and homologs by using BLAST against UniProtKB, UniRefs 
and phylogenomic resources). If two entries from the same gene 
and species are identified then they are merged; annotations between 
the remaining entries are then standardised. It would appear that 
attempts to standardise growing sets of similar entries is having a 
detrimental effect on the quality of both individual entries and the 
overall database. 

In addition to being used as a quality measure, the approach 
described here could be used for artifactual error detection. Our 
early analysis identified information with no biological significance 
(copyright statements) included within the comment lines. 
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Our focus in this paper was on UniProtKB, and we have not 
tested across other databases. One database of immediate interest 
would be InterPro. The work in this paper focuses on protein 
annotation; extending this to InterPro would allow us to analyse 
protein family annotation which would normalize for the many near 
duplicate records of the large protein families found in UniProtKB. 
This would require further bulk analysis - however, once data is 
correctly extracted, this form of analysis is straight-forward and 
does not require specialist resources. Analysis of other forms of 
annotation would also be interesting; Kalankesh et al. (2012]l has 
recently reported on similar results in Gene Ontology annotation. 

Further work analysing additional databases would allow us to 
draw more conclusive conclusions regarding the fitting of the power- 
law, and consequently the usability of a as a quality metric. However, 
our analysis of UniProtKB suggests that this approach holds promise 
of being a useful tool for database curators and users alike. 
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